我们提出了一项对基于自我监督的语音表示(S3R)语音转换(VC)的大规模比较研究。在识别合成VC的背景下,S3RS由于其替代昂贵的监督表示的潜力,例如语音后验(PPG),因此很有吸引力,这些表示是由最先进的VC系统采用的。使用先前开发的开源VC软件S3PRL-VC,我们在三种VC设置下提供了一系列深入的目标和主观分析:内部/跨语义的任何一对一(A2O)和任何对象 - 使用语音转换挑战2020(VCC2020)数据集。我们在各个方面研究了基于S3R的VC,包括模型类型,多语言和监督。我们还研究了通过K-均值聚类的滴定过程的效果,并展示了其在A2A设置中的改进。最后,与最先进的VC系统的比较证明了基于S3R的VC的竞争力,并阐明了可能的改进方向。
translated by 谷歌翻译
基于深度的学习模型具有显着提高了语音分离的性能与鸡尾酒会等输入混合物。突出的方法(例如,频域和时域语音分离)通常建立回归模型以使用基于掩蔽的设计和信号级损耗标准(例如,MSE或SI-)来预测来自混合的地面真理语音。(例如,MSE或SI- SNR)。这项研究首次证明了基于综合的方法也可以在这个问题上表现良好,具有很大的灵活性和强大的潜力。具体地,我们提出了一种基于离散符号的识别的新型语音分离/增强模型,并将语音分离/增强相关任务的范例转换为分类。通过利用具有离散符号的输入的合成模型,在预测离散符号序列之后,可以重新合成每个目标语音。基于WSJ0-2MIX和VCTK-Noisy Corpora的评估结果在各种设置中表明,我们所提出的方法可以稳定地用高语音质量且没有任何干扰的分离的语音,这难以避免基于回归的方法。此外,通过忽略的聆听质量损失,通过我们的方法可以轻松实现增强/分离演讲的扬声器转换。
translated by 谷歌翻译
这项工作提出了一种自我监督的方法,用于学习密集的语义上丰富的视觉概念嵌入式,用于通过在NLP中学习Word Embeddings的方法启发的图像。我们的方法通过产生更多富有表现力的嵌入来提高现有工作,并通过适用于高分辨率图像。将自然图像的生成作为一种随机过程,其中一组潜在的视觉概念产生可观察像素外观,我们的方法被配制,以从像素到概念的反向映射。我们的方法大大提高了自我监督学习对密集嵌入映射的有效性,通过将超装配作为自然等级从像素从像素向一小组视觉相干区域进行了向上。其他贡献是具有非均匀形状的区域上下文掩蔽,匹配视觉相干的补丁和基于复杂的视图采样,由屏蔽语言模型启发。通过显着改善Coco(+12.94 miou,+87.6 \%)和城市景观(+16.52 miou,+134.2 \%)的最先进的代表性质量基准来证明了我们密集嵌入的增强的表现力。结果表明,未参加工作未能证明的较好的缩放和域泛化性能。
translated by 谷歌翻译
This paper introduces a new open source platform for end-toend speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and Py-Torch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
translated by 谷歌翻译
Classification bandits are multi-armed bandit problems whose task is to classify a given set of arms into either positive or negative class depending on whether the rate of the arms with the expected reward of at least h is not less than w for given thresholds h and w. We study a special classification bandit problem in which arms correspond to points x in d-dimensional real space with expected rewards f(x) which are generated according to a Gaussian process prior. We develop a framework algorithm for the problem using various arm selection policies and propose policies called FCB and FTSV. We show a smaller sample complexity upper bound for FCB than that for the existing algorithm of the level set estimation, in which whether f(x) is at least h or not must be decided for every arm's x. Arm selection policies depending on an estimated rate of arms with rewards of at least h are also proposed and shown to improve empirical sample complexity. According to our experimental results, the rate-estimation versions of FCB and FTSV, together with that of the popular active learning policy that selects the point with the maximum variance, outperform other policies for synthetic functions, and the version of FTSV is also the best performer for our real-world dataset.
translated by 谷歌翻译
Diagnostic radiologists need artificial intelligence (AI) for medical imaging, but access to medical images required for training in AI has become increasingly restrictive. To release and use medical images, we need an algorithm that can simultaneously protect privacy and preserve pathologies in medical images. To develop such an algorithm, here, we propose DP-GLOW, a hybrid of a local differential privacy (LDP) algorithm and one of the flow-based deep generative models (GLOW). By applying a GLOW model, we disentangle the pixelwise correlation of images, which makes it difficult to protect privacy with straightforward LDP algorithms for images. Specifically, we map images onto the latent vector of the GLOW model, each element of which follows an independent normal distribution, and we apply the Laplace mechanism to the latent vector. Moreover, we applied DP-GLOW to chest X-ray images to generate LDP images while preserving pathologies.
translated by 谷歌翻译
Fisher's criterion is a widely used tool in machine learning for feature selection. For large search spaces, Fisher's criterion can provide a scalable solution to select features. A challenging limitation of Fisher's criterion, however, is that it performs poorly when mean values of class-conditional distributions are close to each other. Motivated by this challenge, we propose an extension of Fisher's criterion to overcome this limitation. The proposed extension utilizes the available heteroscedasticity of class-conditional distributions to distinguish one class from another. Additionally, we describe how our theoretical results can be casted into a neural network framework, and conduct a proof-of-concept experiment to demonstrate the viability of our approach to solve classification problems.
translated by 谷歌翻译
Text-to-speech synthesis (TTS) is a task to convert texts into speech. Two of the factors that have been driving TTS are the advancements of probabilistic models and latent representation learning. We propose a TTS method based on latent variable conversion using a diffusion probabilistic model and the variational autoencoder (VAE). In our TTS method, we use a waveform model based on VAE, a diffusion model that predicts the distribution of latent variables in the waveform model from texts, and an alignment model that learns alignments between the text and speech latent sequences. Our method integrates diffusion with VAE by modeling both mean and variance parameters with diffusion, where the target distribution is determined by approximation from VAE. This latent variable conversion framework potentially enables us to flexibly incorporate various latent feature extractors. Our experiments show that our method is robust to linguistic labels with poor orthography and alignment errors.
translated by 谷歌翻译
End-to-end text-to-speech synthesis (TTS) can generate highly natural synthetic speech from raw text. However, rendering the correct pitch accents is still a challenging problem for end-to-end TTS. To tackle the challenge of rendering correct pitch accent in Japanese end-to-end TTS, we adopt PnG~BERT, a self-supervised pretrained model in the character and phoneme domain for TTS. We investigate the effects of features captured by PnG~BERT on Japanese TTS by modifying the fine-tuning condition to determine the conditions helpful inferring pitch accents. We manipulate content of PnG~BERT features from being text-oriented to speech-oriented by changing the number of fine-tuned layers during TTS. In addition, we teach PnG~BERT pitch accent information by fine-tuning with tone prediction as an additional downstream task. Our experimental results show that the features of PnG~BERT captured by pretraining contain information helpful inferring pitch accent, and PnG~BERT outperforms baseline Tacotron on accent correctness in a listening test.
translated by 谷歌翻译
Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and generalizes to various real-world surfaces. Our key idea is to model the Fresnel reflection and transmission of the surface microgeometry with a collection of oriented mirror facets, both for body and surface reflections. We carefully derive the Fresnel reflection and transmission for each microfacet as well as the light transport between them in the subsurface. This physically-grounded modeling also allows us to express the polarimetric behavior of reflected light in addition to its radiometric behavior. That is, FMBRDF unifies not only body and surface reflections but also light reflection in radiometry and polarization and represents them in a single model. Experimental results demonstrate its effectiveness in accuracy, expressive power, and image-based estimation.
translated by 谷歌翻译